Processing math: 76%

Previous Up Next

Cool Assembly Language

Cool Assembly Language is a simplified RISC-style assembly language that is reminiscent of MIPS Assembly Language crossed with x86 Assembly Language. It also features typing aspects that may remind one of Java Bytecode.

A Cool Assembly Language program is a list of instructions. Each instruction may be preceded by any number of labels. Comments follow the standard Cool conventions. In addition, a semicolon ; functions like a double dash in that it marks the rest of that line as a comment. The Cool CPU is a load-store architecture with eight general purpose registers and three special-purpose registers. For simplicity, a machine word can hold either a 32-bit integer value or an entire raw string; regardless, all machine words have size one.

This document assumes that you already have some familiarity with assembly language, registers, and how CPUs operate. We first present a formal grammar and then explain the semantics. Only terms in typewriter font are part of the formal grammar. Text after -- is a comment. We use italics for non-terminals.

register ::= r0 -- general purpose register #0, often used as the accumulatorregister ::= r1 -- general purpose register #1register ::= r2register ::= r3register ::= r4register ::= r5register ::= r6register ::= r7register ::= sp -- stack pointer registerregister ::= fp -- frame pointer registerregister ::= ra -- return address pointer instruction ::= li register -- load immediateinstruction ::= mov register < register -- register-to-register copyinstruction ::= add register < register registerinstruction ::= sub register < register registerinstruction ::= mul register < register registerinstruction ::= div register < register registerinstruction ::= jmp label -- unconditional branchinstruction ::= bz register label -- branch if equal to zeroinstruction ::= bnz register label -- branch if not zeroinstruction ::= beq register register label -- branch if equalinstruction ::= blt register register label -- branch if less thaninstruction ::= ble register register label -- branch if less than or equal toinstruction ::= call label -- direct function callinstruction ::= call register -- register-indirect function callinstruction ::= return -- function returninstruction ::= push register -- push a value on the stackinstruction ::= pop register -- push a value off the stackinstruction ::= ld register < register [ integer ] -- load a value from memoryinstruction ::= st register [ integer ] < register -- store a value into memoryinstruction ::= la register < label -- load an address into a registerinstruction ::= alloc register register -- allocate memoryinstruction ::= constant integer -- lay out a compile-time constant in memoryinstruction ::= constant raw_string -- lay out a compile-time constant in memoryinstruction ::= constant label -- lay out a compile-time constant in memoryinstruction ::= syscall name -- request a service from the run-time systeminstruction ::= debug register -- debugging support: print register valueinstruction ::= trace -- toggle tracing

That's it, and the last two do not really count. We next describe the interpretation of these instructions in more detail.

The system calls available are:

That system calls correspond directly to internal predefined methods on Cool Int and String objects. The key difference is that the system calls work on raw values (i.e., machine-level ints and strings) and not on Cool Objects.

Cool CPU Simulator

The normal Cool compiler executable (e.g., cool.exe) also serves as a Cool CPU Simulator that executes Cool Assembly Language programs. Just pass file.clasm as an argument.

The simulator performs the following actions:

  1. Load the .clasm program into memory starting at address 1000. That is, if the first instruction in file.clasm is mov r1, r2, then memory location 1000 will hold the instruction mov r1, r2. If the second instruction in file.clasm is constant 55, then memory location 1001 will hold the integer 55.
  2. Set sp and fp to 2,000,000,000. Remember, the stack starts at high addresses and grows down.
  3. Search file.clasm for a label named start. The program counter is set to the address corresponding to that label. For example, if start: occurs just before the third instruction in file.clasm, then the program counter starts at 1002.
  4. Fetch the instruction pointed to by the program counter and execute it. Unless the instruction specifies otherwise, the program counter is incremented by one and the process repeats.
  5. When memory is allocated (e.g., by the alloc instruction), addresses starting from at least 20,000 are used.
  6. If more than 1000 call instructions are executed before any return instructions are executed (i.e., if there are more than 1000 calls on the stack), the simulator terminates and prints a stack overflow error.

The constant values listed above (1000; 20,000; 2,000,000,000) should not be counted on by your program, but are listed here to help with debugging. Addresses near 1000 hold program instructions or compile-time data (i.e., the code segment), addresses near 20,000 hold the heap, and addresses near two billion are on the stack.

Debugging

Debugging assembly language programs is notoriously difficult! While writing your code generator, you will spend quite a bit of time running generated Cool Assembly programs through the Cool CPU Simulator to see if they work. Often they will not. The Cool CPU Simulator has been designed with a large number of features to aid debugging. Basically none of these features are present in traditional assemblers, so you actually have a wealth of debugging support, but it will still be difficult.

Control Flow Graphs

The Cool reference compiler also includes options to produce control flow graphic visualizations in the style of the dotty tool from the Graphviz toolkit.

Passing the cfg option (with, for example, opt asm) produces method.dot, which can then be inspected via a number of tools. For example, this program:

class Main {
  main():Object {
    if (isvoid self) then  
      (new IO).out_string("cannot happen!\n")
    else 
      (new IO).out_string("hello, world!\n")
    fi 
  };
};

Might produce this control-flow graph:

While you do not have to match the reference compiler exactly, inspecting its control-flow graphs can help you debug your own code to create control-flow graphs.

Performance Model

As discussed above, the Cool reference compiler also includes a reference machine simulator to interpret Cool Assembly Language instructions. This simulator can be invoked directly by passing a \rm .cl \HY asm file to \rm cool.exe:

cool$ cat hello-world.cl
class Main {
  main():Object {
    (new IO).out_string("hello, world!\n")
  };
};
cool$ ./cool --asm hello-world.cl
cool$ ./cool hello-world.cl-asm 
hello, world!

The simulator can also give detailed performance information:

cool$ ./cool --profile hello-world.cl-asm 
hello, world!
PROFILE:           instructions =        107 @    1 =>        107
PROFILE:        pushes and pops =         29 @    1 =>         29
PROFILE:             cache hits =         22 @    0 =>          0
PROFILE:           cache misses =        570 @  100 =>      57000
PROFILE:     branch predictions =          0 @    0 =>          0
PROFILE:  branch mispredictions =         11 @   20 =>        220
PROFILE:        multiplications =          0 @   10 =>          0
PROFILE:              divisions =          0 @   40 =>          0
PROFILE:           system calls =          2 @ 1000 =>       2000
CYCLES: 59356

The execution time of a Cool Assembly Language program is measured in simulated instruction cycles. In general, each assembly instruction takes one cycle. Some instructions, such as system calls or memory operation, can cost many more cycles. The total cycle cost of a program is the sum of all of its component cycle costs.

In modern architectures, memory hierarchy effects (e.g., caching) and branch prediction are dominant factors in the execution speed of a program. To give you a flavor for what real-world code optimization is like, the Cool Simulator also simulates a cache and a branch predictor.

The Cool Simulator features a 64-word least-recently-used fully associative combined instruction and data cache. It also uses a static backward = taken, forward = not taken branch prediction scheme.

We now discuss each of the performance components in turn:

  1. \bf instructions. Each Cool Assembly Language instruction executed costs at least one cycle. This represents the time taken to fetch and decode the instruction, as well as to shepherd it through the pipeline. Instructions such as \rm li, \rm mov and \rm add take one cycle.
  2. \bf pushes\ and\ pops. Such \rm push and \rm pop involve both a load/store and also an add/sub, each costs an additional cycle (for a total of two). (\rm push and \rm pop can also incur cache miss penalties; see below.)
  3. \bf cache\ hits\ \&\ misses. In modern computers, the CPU executes much faster than main memory: hundreds of "normal" instructions can be executed in the time it takes to fetch one value from memory. To mitigate this problem, a small number of values are placed in expensive, high-speed memory near the CPU. This small, fast memory stores recently-used values and is known as a \bf cache. The Cool Simulator features a 64-word fully-associated cache: the values associated with 64 addresses can be accessed rapidly. If a memory read or write accesses an address that is in the cache, the instruction completes immediately with no extra cost. If a memory read or write accesses an address that is not in the cache, it costs 100 cycles while that value is read in from main memory. If there is no room in the cache to hold that new address's value, the address that has been touched (read or written) least recently is evicted and the new address/value is put in its place. Typical reasons for cache misses include compulsory, capacity and conflict. Note that the cache and cache miss penalty apply to every access to memory. This Includes:
  4. \bf branch\ prediction\ \&\ misprediction. In a modern pipelined CPU, the next instruction is fetched before the current instruction has completed. This means that the CPU needs to know the address of the next instruction as early as possible. For a conditional branch, that may be difficult: the CPU may have to wait until the comparison is complete to determine if the next instruction will be at \rm pc+1 or \rm label. Modern CPUs optimistically "guess" or "predict" that a branch will go one way or the other and then rollback instructions if they are wrong. A correctly-predicted branch costs nothing; a mispredicted branch costs 20 cycles. The following instructions are related to this cost:
  5. \bf multiplication\ \&\ division. Integer multiplication and division take longer on most architectures than addition and subtraction. In the Cool Simulator, \rm mul costs an extra 10 cycles and \rm div costs an extra 40.
  6. \bf system\ calls. A system call involves trapping to the operating system, switching CPU protection contexts, putting the old process on the scheduling queue, handling the operation, rescheduling the new process, and switching CPU protection contexts again. System calls take forever. In the Cool Simulator, each \rm syscall instruction takes 1000 extra cycles.

This cost model involves realistic components but potentially unrealistic values (e.g., a modern CPU would have a much larger non-associative cache, and also a much larger cache miss cost). If you're interested in that sort of performance modeling, take a graduate class in computer architecture. You should know that this CPU performance model is one of the most realistic that I've seen for a compiler optimization project in terms of the issues that it forces you to address.

The reference compiler includes a simple reference peephole optimizer, as well as a few optimizations backed by dataflow analyses (liveness, reaching definitions, constant folding) and register allocation enabled via the \rm --opt flag. You can use it to get an idea for how to get started (but note that we are evil and strip all comments from the optimized output).

yuki:~/src/cool$ ./cool --opt --asm hello-world.cl
yuki:~/src/cool$ ./cool --profile hello-world.cl-asm 
hello, world!
PROFILE:           instructions =         79 @    1 =>         79
PROFILE:        pushes and pops =         23 @    1 =>         23
PROFILE:             cache hits =         15 @    0 =>          0
PROFILE:           cache misses =        513 @  100 =>      51300
PROFILE:     branch predictions =          2 @    0 =>          0
PROFILE:  branch mispredictions =          7 @   20 =>        140
PROFILE:        multiplications =          0 @   10 =>          0
PROFILE:              divisions =          0 @   40 =>          0
PROFILE:           system calls =          2 @ 1000 =>       2000
CYCLES: 53542

For the \rm hello-world program, this optimizer reduces the cycle cost from 59356 to 53453 -- a 10% improvement. If you are writing an optimizer, you will want to do at least as well as the reference, averaged over many input programs. Notably, you'll probably want to implement much more than the required dead code elimination optimization.

Previous Up Next